Link-Based Similarity Search to Fight Web Spam
نویسندگان
چکیده
We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms.
منابع مشابه
A Perspective of Evolution After Five Years: A Large-Scale Study of Web Spam Evolution
Identifying and detecting web spam is an ongoing battle between spam-researchers and spammers which has been going on since search engines allowed searching of web pages to the modern sharing of web links via social networks. A common challenge faced by spam-researchers is the fact that new techniques depend on requiring a corpus of legitimate and spam web pages. Although large corpora of legit...
متن کاملLink Spam Detection based on DBSpamClust with Fuzzy C-means Clustering
This Search engine became omnipresent means for ingoing to the web. Spamming Search engine is the technique to deceiving the ranking in search engine and it inflates the ranking. Web spammers have taken advantage of the vulnerability of link based ranking algorithms by creating many artificial references or links in order to acquire higher-than-deserved ranking n search engines' results. Link b...
متن کاملUsing Rank Propagation and Probabilistic Counting for Link-Based Spam Detection
This paper describes a technique for automating the detection of Web link spam, that is, groups of pages that are linked together with the sole purpose of obtaining an undeservedly high score in search engines. The problem of Web spam is widespread and difficult to solve, mostly due to the large size of web collections that makes many algorithms unfeasible in practice. For spam detection we app...
متن کاملLink-Based Spam Algorithms in Adversarial Information Retrieval
Web spam has become one of the most exciting challenges and threats to Web search engines. The relationship between the search systems and those who try to manipulate them came up with the field of adversarial information retrieval. In this paper, we have set up several experiments to compare HostRank and TrustRank to show how effective it is for TrustRank to combat Web spam and we have also re...
متن کاملLink Spam Alliances
Link spam is used to increase the ranking of certain target web pages by misleading the connectivity-based ranking algorithms in search engines. In this paper we study how web pages can be interconnected in a spam farm in order to optimize rankings. We also study alliances, that is, interconnections of spam farms. Our results identify the optimal structures and quantify the potential gains. In ...
متن کامل